Spectral Classification Using Restricted Boltzmann 

Machine 



Fuqiang Chen* 

CO | Department of Computer Science 

College of Electronics & Information Engineering 
Tongji University 
Cao'an Highway, Jiading, Shanghai, 201804 
£>V 2012fuqiangchen@tongji.edu.cn 



o 

(N 



CO ■ Abstract 



o 

o 



O 



The spectra of a celestial body is composed of a great number of spectrum, thus 
it's necessary for us to explore the appropriate latent features to represent the ce- 
lestial body well. Restricted Boltzmann Machine (RBM) is a bipartite generative 
graphical model composed of two layers (one visible layer and one hidden layer), 
which can extract higher level features to represent the original data. Despite gen- 
erative, RBM can be used for classification when modified with free energy and 
softmax function. In this paper, to apply the binary RBM for stellar spectral clas- 
^ i sification, we first perform "threshold binaryzation" on the original data by some 

' rule (for detail, c.f. Section 4), then we resort to binary RBM to classify cata- 

clysmic variables (CVs) from non-CVs (one half of all the given data for training 
and the other half for testing). The experiment result shows state-of-the-art accu- 
racy of 100%, and it outperforms support vector machine (SVM), with accuracy 
in! of 99.14%. 

O 
CO 

' 

1 Introduction 

In this paper, we present a novel application of RBM to spectral classification, classifying CVs 
and non-CVs. We will introduce the CVs, previous work about classification of spectra from stars, 
previous application of RBM in this section. 

Cataclysmic variables (CVs) are composed of the close binaries that contain a white dwarf accreting 
material from its companion [1]. We can divide CVs into three categories: novae, dwarf novae, and 
nova-likes. These three classes differ from each other by amplitudes and timescale for variability. A 
nova can outburst because of a thermonuclear runaway creating a rise in brightness of 10-20 mag. 
Dwarf nova undergo 2-7 mag outbursts from a normal quiescent state quasi-periodically, sometimes 
short as several weeks and sometimes long to several years. A nova-like can exhibit variability of 
several magnitudes, which maybe related to changes in mass transfer leading to states of low or high 
accretion [2]. 

In short, cataclysmic variables (CVs) are binary star systems composed of a white dwarf and a 
normal star companion. They are typically small with an orbital period of 1 to 10 hours. The white 
dwarf is often called "primary" star, and the normal star as the "companion" or the "secondary" star. 
The companion star, a star that is "normal" like our Sun, loses its material onto the white dwarf via 
accretion. Besides, CVs are fairly faint in X-rays; they are just above the coronal X-ray sources and 
far below the X-ray binaries caused by how powerful their X-ray emissions are. 



*Fuqiang Chen, an undergraduate master in Tongji university. His research interests include machine learn- 
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Recently, various machine learning methods have been applied to stellar classification. Mostly, the 
researchers classify the stars based on their spectra, which is easy to get, because many countries 
have established astronomical observatories to get the spectra. Spectral characteristics can be applied 
to classify stars which contains the information about the temperature in a different way-particular 
absorption lines can be observed only for a certain range of temperatures because only in that range 
are the involved atomic energy levels populated [2] . 

By far, there have been many researches on stellar spectral classification based on various pattern 
classification methods. 

1.1 Previous work in stellar classification 

In 1998, Harinder P. Singh et al. applied principal component analysis (PCA) and artificial neural 
network (ANN) to stellar spectral classification [3] for O to M type stars, where O stars are the hottest 
and the letter sequence indicates successively cooler stars up to the coolest M class. They adopted 
PCA for dimension reduction firstly, in which they reduce the dimension to 20, with the cumulative 
percentages larger than 99.9%. Then they used multi-layer back propagation (BP) neural network 
for classification. In 2010, Rosalie C. McGurk et al. (2010) applied principal component analysis to 
SDSS stellar spectra, SDSS DR6 [4]. They found that the first 4 principal components could remain 
enough information of the original data. And their work made it easier to classify new spectra, to 
search for unusual spectra and to train various spectral classification methods. 

In 2012, Mahdi Bazarghan applied self-organizing map (SOM, a kind of unsupervised Artificial 
Neural Network (ANN) algorithm) to stellar spectral from the Jacoby, Hunter and Christian (JHC) 
library, and he obtained the accuracy of about 92.4% [5]. In 2012, S. G. Navarro et al. used artificial 
neural network to classify stellar spectra with low signal-to-noise ratio on sample of field stars along 
the line of sight toward the planetary nebulae Abell 63, NGC 6781, and NGC 7027 [6]. They trained 
and tested the ANNs with different S/N levels, founding that the insensitivity of ANNs to noise and 
the ANN's error rate to be smaller with two hidden layers and more than 20 nodes in each. 

In the above, we reviewed the application of PCA and ANN to spectral classification. Besides, S VM 
and decision trees were also researched on spectral classification by some researchers. In 2004, Y. 
Zhang et al. applied learning vector quantization (LVQ), single-layer perceptron (SLP) and support 
vector machines (SVM) for multi-wavelength data classification of AGNs (active galactic nucleus ) 
and S&G (stars and normal galaxies) [7], in which they used the histogram as the feature selection 
technique. They found that SVM's performance is better than neural network method with more 
available features. In 2006, Nicholas M. Ball et al. applied decision trees to classify star-galaxy of 
the Sloan Digital Sky Survey (SDSS) DR3 [8]. They investigated classification of all 143 million 
nonrepeat photometric objects in SDSS DR3 with 477,068 objects in SDSS spectroscopic data for 
training. And there are 361,490 galaxies, 62,333 stars, 49,545 quasars, and 3700 unknown objects in 
their training data. For classification, they classified the objects as either galaxy, star, or nsng (neither 
star nor galaxy). Based on the classification result, they associated each class with probability. 

In the above, we have reviewed the research in star spectral classification based on linear dimension 
reduction technique, PCA. By far, nonlinear dimension reduction technique has also been applied 
in star spectral classification. In 201 1, Scott F. Daniel et al. applied locally linear embedding (LLE, 
a nonlinear dimension reduction technique) to classify stellar spectra from SDSS DR7 [9]. The 
data in their experiment is chosen from the 85,564 DR7 sources classified as science-quality stellar 
spectra. They found that most of stellar spectra could be represented as one-dimensional sequence 
within three-dimensional space. And the position along the sequence was highly correlated with the 
spectral temperature. Besides, they proposed a hierarchical classification scheme based on LLE. It's 
advantage is that it requires no feature extraction. 

2 Present application of RBM 

In this section, we present some representative applications of RBM up to today. In 2007, Salakhut- 
dinov, R. et al. [10] applied RBM for collaborative filtering. In 2008, Gunawardana, A. et al. [11] 
applied RBM for cold start recommendations. In 2009, Taylor, G. W. et al. [12] applied RBM for 
modeling motion style. In 2010, Dahl G.E. et al. [13] applied RBM to phone recognition on the 
TIMIT dataset. They applied RBM to extract features of phone. In 201 1, Schluter, J. et al. [14] ap- 
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plied RBM to estimate music similarity. In 2012, Tang, Y. et al. [15] applied RBM for recognition 
and denoising. 

2.1 Our work 

In this paper, we apply RBM to classify stellar spectra of cataclysmic variables (CVs) and non-CVs 
obtained from SDSS DR7 [16]. Generally, before applying a classifier to classify the data, we always 
preprocess the original data, for example, normlization. Thus, firstly, we normalize the spectra with 
unit norm. Then, to apply binary RBM for stellar classification, we binarize the normalized spectra 
by some rule which we'll discuss in the experiment. Finally, we use RBM to classify the two kinds 
of stars, one half of all the given data for training and the other half for testing. The experiment 
result shows that the classification accuracy is 100%, which is state-of-the-art. RBM outperforms 
the popular classifier SVM (support vector machine). 

The rest of this paper is organized as follows. In section 2, we review the prerequisite for training 
restricted Boltzmann machine. In section 3, we introduce the binary RBM and the training algorithm 
for RBM. In section 4, we present the experiment result. Finally, in section 5, we conclude our work 
and present the future work. 

3 Prerequisites 

3.1 Markov Chain 

A Markov chain is a random variable sequence transiting from one state to another one, a special 
kind of stochastic process. Generally, there are a finite or countable number of possible states in the 
process or transition. It is a random memoryless process: the next state depends only on the current 
state and not on the sequence of the events that preceded it. The character of 'memorylessness' is 
called the Markov property. 

Mathematically, a Markov chain is a sequence composed of many random variables X\ , X 2 , -^3 , . . . 
with the following Markov property: 

P(X n+1 = x\X-y = Xi,X 2 =x 2 ,-..,X n = x n ) = P(X n+1 = x\X n = x n ). 

The possible values of X i7 Xi form a countable set S called the state space of the chain. Generally, 
we use transition matrix to present the probability of the transition from one state to the other one. 
The transition matrix has three properties: square; the value of a specific element is between and 
1; all the elements in each row sums to 1. If the initial vector is Xq, a row vector, the transition 
matrix T, then after n steps of inference, we can get the vector Xq ■ T n . 

Then we introduce the equilibrium of a Markov chain. If there exists some power of the transition 
matrix rendering all the elements in the resulting matrix nonzero, then we say the transition matrix 
is a regular transition matrix. Then, if the transition matrix T is regular, and there exists a unique 
vector V satisfying v ■ T n approximately equals V, for any probability vector v and large enough n. 
We call the vector V the equilibrium vector or the fixed vector of the Markov chain [17]. 

3.2 MCMC 

Markov chain Monte Carlo (MCMC) method is a kind of algorithm to sample from the probability 
distribution via constructing a Markov chain with a desired distribution (i.e., the equilibrium dis- 
tribution). The status of the chain after running many times can be regarded as a sample from the 
desired distribution. The more the step number is, the more likely the sample is sampled from the de- 
sired probability distribution. MCMC can be applied for simulation-draw fair (typical) samples from 
a probability which governs a system;integration/computing in very high dimensions;optimization 
with an annealing scheme;learning-unsupervised learning with hidden variables (simulated from 
posterior) or MLE learning of parameters [17]. 

Typical MCMC sampling can only approximate the target distribution for the residual effect of the 
starting position. More sophisticated MCMC-based algorithms (coupling from the past) can produce 
exact samples with additional computation and infinite time [17]. 
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3.3 Gibbs Sampling 



Gibbs sampling or a Gibbs sampler is an MCMC algorithm to obtain a sequence of observations ap- 
proximately from a specific multivariate probability distribution, when sampling directly is difficult. 
This sequence can be applied to approximate the joint distribution; to approximate the marginal dis- 
tribution with respect to one of the variables, or some subset of the variables. And Gibbs sampling 
is a general method for probabilistic inference. 

Gibbs sampling generates a Markov chain of samples subject to the condition of each of the sample 
correlated with nearby samples. Thus we must be careful if independent samples are desired. 



4 RBM 

Considering that RBM is a generalized version of Boltzmann Machine (BM), we first review BM. 
BM is a bipartite graphical generative model which is composed of two layers of a number of 
units with lateral connection. One layer is the visible layer v with m binary visible units Vi, i.e., 
Vi = or Vi = 1 (i = 1,2, ... , m). For each unit in this layer, the corresponding value is observable. 
The other layer is the hidden (latent) layer h with n binary hidden units hj, as in the visible layer, 
hj = or hj = 1 (j = 1,2,..., n). For the unit in the hidden layer, the corresponding value is 
hidden or latent, unobservable. Thus we can resort to probability to infer the values. 

The units from the two layers of a BM are connected with weighted edge wholly, with the weights 
Wij {v% <->• hj) . And for both the two layers, the units within each specific layer are also connected 
with each other, and also with weights. 

For BM, the energy function is defined as follows: 



E(v,h) = - ^2 VidijVj - ^2 hjdjjhj ~ y]y] VjWjjhj - y~] VjCj - hjbj, (1) 

where is the weight of the edge connecting visible units Vi and Vj, dij the weight of the edge 
connecting hidden units hi and hj, Wij the weight of the edge connecting visible unit Vi and hidden 
unit hj. The bj is the bias in the following activation function (Sigmoid function f(x) = 1/(1 + 



p(hj = l\v) = 



1 



1 i + e -b 3 -2:™ 

d the bias in the following formula: 



p(v t = l\h) 
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Then for each pair (v, h), the probability of this pair can be defined as follows: 

e -E(v,h) 

P( v ' h ) = p F » 

where PF is the partition function: 



PF = J2p^, h). 

v.h 

RBM is a graphical model with the units for both layers not connected within a specific layer, i.e., 
there are only connections between the two layers for RBM [18]. Mathematically, for RBM, a i3 = 
for i, j — 1,2, ... , to and dij — for i, j — 1, 2, . . . , n. Thus the state of units hj are independent 
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given v and so are the units Vi given h. Then we can obtain the following formula in factorization 
form: 

p(h\v) = Y[p(hj\v) md p(v\h) = J[p(vi\h). 

j i 

4.1 Contrastive Divergence 

Contrastive Divergence (CD) is proposed by Hinton [19] to train RBM. CD is based on Gibbs 
Sampling and Markov chain Monte carlo method. Initially, we are given Vi (i = 1, 2, . . . , m), then 
we can obtain hj (j = 1, 2, . . . , n) by the sigmoid function given in the above. And the value of hj 
is determined by comparing a random value r ranging from to 1 with the probability p(hj = l\v). 
Then we can reconstruct v by the sigmoid function p(vi = l\h). 

We can repeat the above process back and forward until the reconstruction error is small enough or 
it has reached the maximum iteration which is set beforehand. To update the weights and biases in 
RBM using gradient descent algorithm, we need to compute the following partial derivative firstly: 

dlogp(v,h) 

q^j- = E dat a[vihj\ - E recon [Vihj], (2) 

d\ogp(v,h) _ 

— Vi ^recon\Vi\ 7 \J) 



Oct 
d\ogp(v, h) 



dbj 



Edata [hj\ - E recon [hj] [16] , (4) 



where E[*] represent the expectation of and the subscript 'data' means that the probability is 
original-data-driven while the 'recon' means the probability is reconstructed-data-driven. 

Then we can update the weight by the following rule: 

AlVij = lliEdata^hj] - E recon [Vihj]), 

where rj is the learning rate, which influences the speed of convergence. The biases can be updated 
similarly. 

In equation (2)-(4), E da ta [•] 's are easy to compute. However, to compute or inference the latter term 
E reC on[-], we need to resort to MCMC. 

4.2 Free energy and Softmax 

To apply RBM for classification, we train a separate RBM on each class. And for classification, 
we need the free energy and softmax function for help. The free energy of a visible vector v is the 
energy that a single configuration would need to have in order to have the same probability as all 
of the configurations that contain v. We can compute the free energy [20] for a visible vector v as 
follows: 

F(«) = -J>&i- J>g(l + e*'), 

i 3 

where xj = bj + J2i v iWij- 

The log probability that the RBM trained on a specific class c assigns to the test vector v is given by: 



logp(v\c) = -F c (v) - \ogZ c , 

where Z c is the partition function of that RBM. If the total number of classes we have is small (in our 
paper it equals 2), it is easy to deal with or compute the unknown log partition functions by simply 
training a "softmax" model (on a separate training set) to predict the class from the free energies of 
all of the class specific RBMs: 
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\ogp(class = c\v) 



e -F c (v)-\og Z c 

^e-i^M-iogZd ' 



where the Z's are parameters that are learned by maximum likelihood training of the softmax func- 
tion. Here, the "softmax" function for a unit is defined as follows: 

e Xj 

P j = — k , 

__.i=l x i 

where the K means that the unit can take on K states. 



5 Experiment 
5.1 Data description 

The Sloan Digital Sky Survey (SDSS) is one of the most ambitious and influential surveys in the 
history of astronomy. Over eight years of operations (SDSS-I, 2000-2005; SDSS-II, 2005-2008), it 
obtained deep, multi-color images covering more than a quarter of the sky and created 3-dimensional 
maps containing more than 930,000 galaxies and more than 120,000 quasars [21]. Data Release 7 
(DR7) is the seventh major data release and provides images, imaging catalogs, spectra, and red- 
shifts for download. The Sloan Digital Sky Survey or SDSS is a major multi-filter imaging and 
spectroscopic redshift survey using a dedicated 2.5-m wide-angle optical telescope at Apache Point 
Observatory in New Mexico, United States. Data collection began in 2000, and the final imaging 
data release covers over 35% of the sky, with photometric observations of around 500 million ob- 
jects and spectra for more than 1 million objects. The main galaxy sample has a median redshift 
of z — 0.1; there are redshifts for luminous red galaxies as far as z = 0.7, and for quasars as far 
as z = 5; and the imaging survey has been involved in the detection of quasars beyond a redshift 
z = 60 

All of our data for experiment is from SDSS DR7. The samples is divided into two classes, one 
class composed of the normal stars (non-CVs) and the other the cataclysmic variables (CVs). There 
are totally 6818 normal stars and 208 cataclysmic variables. Both of them are composed of 3522 
variables, or rather, spectra components. Among the 6818 normal stars, there are 1,559 from Galaxy, 
3,981 Stars and 1,278 QSOs. We chose randomly half of the whole data for training and the remain 
for testing. For clarity, we show the data in the following table (Table 1). 

Table 1: The number of the original data, the number of the data for 
training and for testing, where 3522 is the dimension. 



non- 


■CVs 


CVs 


TRAIN 


TEST 


TRAIN TEST 


3414x3522 


3414x3522 


104x3522 104x3522 



5.2 Parameter chosen 

In this section, we present the parameters in our experiment. We chose all the parameters referring 
to [20] . The learning rate in the process of updating is set to be 0. 1 . The momentum for smoothness 
and to prevent over-fitting is chosen to be 0.5. The maximum number of epochs is chosen to be 50. 
The weight decay factor, penalty, is chosen to be 2x 10 -4 . The initial weight is chosen randomly 
from the standard normal distribution, while the bias is initialized with 0. For clarity, we present 
them in the following table (Table 2). 



'http://en.wikipedia.org/wiki/Sloan_Digital_Sky_Survey 



6 



Table 2: The parameters in our experiment. 



Parameter Value 

learning rate 0. 1 

momentum 0.5 

max epochs 50 

weight decay factor 2 x 10~ 4 

initial bias 



5.3 Experiment result 

We first normalize the data to have unit I2 norm, i.e., for a specific vector x = [xi,X2, ■ ■ ■ , x n ], the 
vector satisfies J^i x 1 = !■ Then we can get two matrixs, one is A =6818x3522 and the other 
B =208 x 3522. Then we find the maximum element and minimum element for C Vs and non-C Vs 
respectively. Finally, to apply binary RBM for classification, we find a parameter to decide the 
variable in our experiment is or 1, or rather, binarization. Mathematically, if 



S{i,j) - mmS(i,j) < a(maxS(i,j) - mmS(i,j)), 

then we set S(i, j) with 0, otherwise we set S(i,j) with 1. Here we use S(i,j) to denote the element 
of the matrix A and B. And the parameter a satisfies < a < 1. To investigate the parameter a, we 
first chose it to be 1/2 heuristically. Then we chose it to be 1/3. The experiment result shows that 
the classification accuracy is 100%, which is state-of-the-art and it outperforms the popular classifier 
SVM [22]. 

For clarity, we show the result in the following table (Table 3). From Table 3, we can see that the 
classification accuracy is 97.2% when a = 1/2, however, almost all of the CVs for testing is labeled 
as non-CVs. 



Table 3: The classification accuracy with different a. 

a Accuracy 

~TJl 3% 

1/3 100% 

1/2 97.2% 



6 Conclusion and future work 

Restricted Boltzmann machine is a bipartite generative graphical model which can extract features 
representing the original data well. By introducing free energy and softmax function, RBM can be 
used for classification. In this paper we apply restricted Boltzmann machine (RBM) for stellar spec- 
tral classification of normal stars (non-CVs) and cataclysmic variables (CVs). And the experiment 
result shows that the classification accuracy is 100%, which is the state-of-the-art and outperforms 
the popular classifier, SVM. 

Since RBM is the building block as deep Boltzmann machine and deep belief nets, then we can 
infer that deep Boltzmann machine [23] and deep belief net can also perform well on stellar spectral 
classification, which is our future work. 
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